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Abstract. The valuation of real estates is of extreme importance for decision mak- 
ing. Their singular characteristics make valuation through hedonic pricing methods 
difficult since the theory does not specify the correct regression functional form nor 
which explanatory variables should be included in the hedonic equation. In this arti- 
cle we perform real estate appraisal using a class of regression models proposed by 
Rigby & Stasinopoulos (2005): generalized additive models for location, scale and 
shape (GAMLSS). Our empirical analysis shows that these models seem to be more 
appropriate for estimation of the hedonic prices function than the regression models 
currently used to that end. 



1. Introduction 

The real estate, apart from being a consumer good that provides comfort and social 
status, is one of the economic pillars of all modern societies. It has become a form 
of stock capital, given the expectations of increasing prices, and a means of obtaining 
financial gains through rental revenues and sale profits. As a consequence, the real 
estate market value has become a parameter of extreme importance. 

The estimation of a real estate value is usually done using a hedonic pricing equation 
according to the methodology proposed by Rosen (1974). It is seen as a heterogeneous 
good comprised of a set of characteristics and it is then important to estimate an explicit 
function, called hedonic price function, that determines which are the most influential 
attributes, or attribute 'package', when it comes to determing its price. However, the 
estimation of a hedonic equation is not a trivial task since the theory does not determine 
the exact functional form nor the relevant conditioning variables. 

The use of classical regression methodologies, such as the classical normal linear 
regression model (CNLRM), for real estate appraisal can lead to biased, inefficient 
and/or inconsistent estimates given the inherent characteristics of the data (e.g., non- 
normality, heteroskedasticity and spatial correlation). The use of generalized linear 
models (GLM) is also subject to shortcomings, since the data may come from a dis- 
tribution outside the exponential family and the functional relationship between the 
response and some conditioning variables may not be the same for all observations. 
There are semiparametric and nonparametric hedonic price estimations available in 
the literature, such as Pace (1993), Anglin & Gencay (1996), Gencay & Yang (1996), 
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Thorsnes & McMillen (1998), Iwata et al. (2000) and Clapp et al. (2002). We also 
highlight the work of Martins-Filho & Bin (2005), who modeled data from the real 
estate market in Multnomah County (Oregon-USA) nonparametrically. We note that 
the use of nonparametric estimation strategies require very large datasets in order to 
avoid the 'curse of dimensionality'. Overall, however, most hedonic price estimations 
are based on traditional methodologies such as the classical linear regression model 
and the class of generalized linear models. 

This article proposes a methodology for real estate mass appraisaQ based on the 
class of GAMLSS models. The superiority GAMLSS modeling relative to traditional 
methodologies is evidenced by an empirical analysis that employs data on urban land 
lots located in the city of Aracaju, Brazil. We perform a real estate evaluation in which 
the response variable is the unit price of land lots and the independent variables are 
reflect the land lots structural, locational and economic characteristics. We estimate 
the location and scale effects semiparametrically in such a way that some covariates 
(the geographical coordinates of the land lot, for instance) enter the predictor non- 
parametrically and their effects are estimated using smoothing splines^] whereas other 
regressors are included in the predictor in the usual parametric fashion. The model 
delivers a fit that is clearly superior to those obtained using the usual approaches. In 
particular, we note that our fit yields a very high pseudo-/? 2 . 

The paper unfolds as follows. In Section 2, we briefly present the class of GAMLSS 
models and highlight its main advantages. In Section 3, we describe the data used 
in the empirical analysis. In Section 4, we present and discuss the empirical results. 
Finally, Section 5 closes the paper with some concluding remarks. 

2. GAMLSS MODELING 

2.1. Definition. Rigby & Stasinopoulos (2005) introduced a general class of statisti- 
cal models called 'additive models for location, scale and shape' (GAMLSS). It en- 
compasses both parametric and semiparametric models, and includes a wide range of 
continuous and discrete distributions for the response variable. It also allows the sy- 
multaneous modeling of several parameters that index the response distribution using 
parametric and/or nonparametric functions. With GAMLSS models, the distribution of 
the response variable is not restricted to the exponential family and different additive 
terms can be included in the regression predictors for the parameters that index the dis- 
tribution, like smoothing splines and random effects, which yields extra flexibility to 
the model. The model is parametric in the sense that the specification of a distribution 
for the response variable is required and at the same time it is semiparametric because 
one can model some conditioning effects through nonparametric functions. 

The probability density function of the response variable y shall be denoted as f(y\0), 
where = (61,62, ... , 9 P ) T is a ^-dimensional parameter vector. It is assumed to belong 
to a wide class of distributions that we denote by D. This class of distributions includes 
continuous and discrete distributions as well as truncated, censored and finite mixtures 

'Evaluation of a set of real properties through methodology and procedures common to all of them. 
2 For more details on smoothing splines, see Silverman (1984) and Eubank (1999). 
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of distributions. In the GAMLSS regression framework the p parameters that index 
f(y\9) are modeled using additive terms. 

Let y = (yi, y 2 , . . . ,y n ) T be the vector of independent observations on the response 
variable, each y ; - having probability density function /(y,|0 l )> i = l,...,n. Here, 6' = 
(0/i > 0/2> • • ■ > 0/p) T is a vector of p parameters associated to the explanatory variables and 
to random effects. When the covariates are stochastic, f(y0) is taken to be conditional 
on their values. Additionally, for k = 1,2, ... ,p, g k (-) is a strictly monotonic link 
function that relates the kth parameter 6 k to explanatory variables and random effects 
through an additive predictor: 

/* 

g k (0 k ) = m = X k fi k + J] Zjk7jk, (2-1) 

7=1 

where 9 k and r\ k are n x 1 vectors, J3 k = (fi\ k ,f3 2k , ■ ■ ■ ,f3j'k) T is a vector of parameters 
of length J' k and X k and Zj k are fixed (covariate) design matrixes of orders nx J' k and 
n x qj k , respectively. Finally, y , k is a ^-dimensional random variable. Model (2.1 ) is 
called GAMLSS (Rigby & Stasinopoulos, 2005). 

In many practical situations it suffices to model four parameters (p = 4), usually 
location (6i = fi), scale (9 2 = cr), skewness (63 = v) and kurtosis (64 = r); the latter 
two are said to be shape parameters. We thus have the following model: 
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(2.2) 



It is also possible to add to the predictor functions hj k that involve smoothers like cu- 
bic splines, penalized splines, fractional polynomials, loess curves, terms of variable 
coefficients, and others. Any combination of these functions can be included in the 
submodels for p, cr, v and r. As Akantziliotou et al. (2002) point out, the GAMLSS 
framework can be applied to the parameters of any population distribution and gener- 
alized to allow the modeling of more than four parameters. 

GAMLSS models can be estimated using the gamlss package for R (Ihaka & 



Gentleman, 1996; Cribari-Neto & Zarkos, 1999), which is free software; see http: 



//www . R-pro j ect . org Practitioners can then choose from more than 50 response 
distributions. 

2.2. Estimation. Two aspects are central to the GAMLSS additive components fit- 
ting, namely: the backfitting algorithm and the fact that quadratic penalties in the 
likelihood function follow from the assumption that all random effects in the linear 
predictor are normally distributed. 
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Suppose that the random effects y jk in Model (2.1 ) are independent and normally 
distributed with y jk ~ N qjk (0, G~ k ), where G* 1 is the q ]k x q jk (generalized) inverse 
of the symmetric matrix Gj k = Gjk{Aj k ). Rigby & Stasinopoulos (2005) note that for 
fixed values of A jk , one can estimate J3 k and y jk by maximizing the following penalized 
log-likelihood function: 



k=l 



where £ = Yll=\ ^°g,{f(y0)} is the log-likelihood function of the data given 6\ for 
i = 1, 2, . . . , n. This can be accomplished by using a backfitting algorithm^] 

2.3. Model selection and diagnostic. GAMLSS model selection is performed by 
comparing various competing models in which different combinations of the com- 
ponents AX = {D, Q, T, A} are used, where D specifies the distribution of the response 
variable, Q is the set of link functions (gi, . . . , g p ) for the parameters (#i , . . . , 6 P ), T de- 
fines the set of predictor terms (t\, . . . , t p ) for the predictors (77^ ... , rj p ) and A specifies 
the set of hiperparameters. 

In the parametric GAMLSS regression setting, each nested model M can be as- 
sessed from its fitted global deviance (GD), given by GD = -2£(6), where £(0) = 
H" = i Two nested competing GAMLSS models Mq and Mu with fitted global 

deviances GD and GDi and error degrees of freedom df e Q and df e \, respectively, can 
be compared using the generalized likelihood ratio test statistic A = GD -GD!, which 
is asymptotically distributed as^- 2 with d = df e0 - df el degrees of freedom under Mq. 
For each model M the number of error degrees of freedom df e is df e = n - YJl=\ dfok, 
where df ek are the degrees of freedom that are used in the predictor of the model for 
the parameter 6 k ,k= 1, . . . , p. 

When comparing non-nested GAMLSS models (including models with smoothing 
terms), the generalized Akaike information criterion (GAIC; Akaike, 1983) can be 
used to penalize overfittings. That is achieved by adding to the fitted global deviances 
a fixed penalty # for each effective degree of freedom that is used in the model, that 
is, GAIC(#) = GD + #df, where df denotes the total effective number of degrees of 
freedom that are used in the model and GD is the fitted global deviance. One then 
selects the model with the smallest GAIC(#) value. 

To assess the overall adequacy of the fitted model, we propose the randomized 
quantile residual (Dunn & Smyth, 1996). It is a randomized version of the Cox & 
Snell (1968) residual and given by 

rf = 0" 1 (M I ), i = l,...,n, 

where O(-) denotes the standard normal distribution function, u t is a uniform random 
variable on the interval (a,-,&,-], with a,- = limy^ F (y,-|0') and b t = F(yi\0 l ). A plot 
of these residuals against the index of the observations (z) should show no detectable 



3 For details, see Rigby & Stasinopoulos (2005, 2007), Hastie & Tibshirani (1990) and Hiirdle et 
al. (2004). 
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pattern. A detectable trend in the plot of some residual against the predictors may be 
suggestive of link function misspecification. 

Also, normal probability plots with simulated envelopes (Atkinson, 1985) or Worm 
plots (Buuren & Fredriks, 2001) are a helpful diagnostic tool. The Worm plots were 
first introduced by Buuren & Fredriks (2001) and are useful for analyzing the residuals 
in different regions (intervals) of the explanatory variable. If no explanatory variable 
is specified, the worm plot becomes a detrended normal QQ plot of the (normalized 
quantile) residuals. When all points lie inside the (dotted) confidence bands (the two 
elliptical curves) there is no evidence of model mispecification. 

In the context of a fully parametric model GAMLSS we can use pseudo R 2 measures. 
For example, R z * = 1 - log L/ log L (McFadden, 1974) and # 2 R = 1 - (L /L) 2/ " (Cox 

& Snell, 1989, pp. 208-209), where L and L are the maximized likelihood functions 
of the null (intercept only) and fitted (unrestricted) models, respectively. The ratio of 
the likelihoods or log-likelihoods may be regarded as a measure of the improvement, 
over the model with 6 l parameters achieved by the model under investigation. 

Our proposal, however, is to compare the different models using the pseudo-/? 2 given 
by the square of the sample correlation coefficient between the response and the fitted 
values. Notice that by doing so we can consider both fully parametric models and 
models that include nonparametric components. We can also compare the explanatory 
power of a GAMLSS model to those of GLM and CNLRM models. This is the pseudo- 
R 2 we shall use. It was introduced by Ferrari and Cribari-Neto (2004) in the context 
of beta regressions and it is a straightfoward generalization of the R 2 measure used in 
linear regression analysis. 



3. Data description 

The data contain 2,109 observations on empty urban land lots located in the city of 
Aracaju, capital of the state of Sergipe (SE), Brazil, and comes from two sources: (i) 
data collected by the authors from real estate agencies, advertisements on newspapers 
and research on location (land lots for sale or already sold); (ii) data obtained from 
the 'Departamento de Cadastro Imobiliario da Prefeitura de Aracaju'. Observations 
cover the following years: 2005, 2006 and 2007. Each land lot price was recorded 
only once during that period. It is also noteworthy that the land lots in the sample 
are geographically referenced relative to the South American Daturrj^] and have their 
geographical positions (latitude, longitude) projected onto the Universal Transverse 
Mercator (UTM) coordinate systemjj 

The sample used to estimate the hedonic prices equation^] contains, besides the year 
of reference, information on the physical (area, front, topography, pavement and block 

4 The South American Datum (SAD) is the regional geodesic system for South America and refers to 
the mathematical representation of the Earth surface at sea level. 

5 Cilindrical cartographic projection of the terrestrial spheroid in 60 secant cylinders at Earth level 
alongside the meridians in multiple zones of 6 degrees longitude and stretching out 80 degrees South 
latitude to 84 degrees North latitude. 

6 That is, the equation of hedonic prices of urban land lots in Aracaju-SE. 
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position), locational (neighborhood, geographical coordinates, utilization coefficient 
and type of street in which the land lot is located) and economic (nature of the infor- 
mation that generated the observation, average income of the head of household of the 
censitary system where the land is located and the land lot price) characteristics of the 
land lots. In particular, we shall use the following variables: 

• YEAR (YR): qualitative ordinal variable that identifies the year in which the 
information was obtained. It assumes the values 2005, 2006 (YR06) and 2007 
(YR07). It enters the model through dummy variables; 

• AREA (AR): continuous quantitative variable, measured in m 2 (square meters), 
relative to the projection on a horizontal plane of the land surface; 

• FRONT (FR): continuous quantitative variable, measured in m (meters), concer- 
ning the projection of the land lot front over a line which is perpendicular to 
one of the lot boundaries, when both are oblique in the same sense, or to the 
'chord', in the case of curved fronts; 

• TOPOGRAPHY (TO): nominal qualitative variable that relates to the topograph- 
ical conformations of the land lot. It is classified as 'plain' when the land 
acclivity is smaller than 10% or its declivity is smaller than 5%, and as 'rough' 
otherwise. It is a dummy variable that equals 1 for 'plain' and 'rough'; 

• PAVEMENT (PA): nominal qualitative variable that indicates the presence or ab- 
sence of pavement (concrete, asphalt, among others) on the street in which the 
main land lot front is located. It enters the model as a dummy variable that 
equals 1 when the land lot is located on a paved street and otherwise; 

• SITUATION (SI): nominal qualitative variable used to differentiate the disposi- 
tion of the land lot on the block. It is classified as 'corner lot' or 'middle lot'. 
It is a dummy variable that assumes value 1 for corner lots and for all other 
land lots; 

• NEIGHBORHOOD (NB): nominal qualitative variable referring to the name of 
the neighborhood where the land lot is located. It was categorized as valu- 
able (highly priced) neighborhoods and other neighborhoods, with the variable 
shown as VN and regarded as a dummy (1 for valuable neighborhoods). The 
neighborhoods were also grouped as belonging or not belonging to the city 
South Zone, dummy denoted by SZ (1 for South Zone); 

• LATITUDE (LAT) and LONGITUDE (LON): continuous quantitative variables cor- 
responding to the geographical position of the land lot at the point z = (LAT, 
LON), where LAT and LON are the coordinates measured in UTM; 

• UTILIZATION COEFFICIENT (UC): discrete variable given by a number that, when 
multiplied by the area of the land lot, yields the maximal area (in square me- 
ters) available for construction. UC is defined in an official urban development 
document. It assumes the following values: 3.0, 3.5, . . . , 5.5, 6.0; 

• STREET (STR): ordinal qualitative variable used to differentiate the land lot lo- 
cation relative to streets and avenues. It is classified as 'minor arterial' (STR1), 
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'collector street' (STR2) and 'local street' according to the importance of the 
street where the land lot is located. It enters the model as dummy variables; 

• NATURE OF THE INFORMATION (NI): nominal qualitative variable that indicates 
whether the observation is derived from 'offer', 'transaction' or from the Ara- 
caju register office (real state sale taxes). It enters the model through dummy 
variables; 

• SECTOR (ST): discrete quantitative proxy variable of macrolocation used to so- 
cioeconomic ally distinguish the various neighborhoods, represented by the av- 
erage income of the head of household, in minimum wages, according to the 
IBGE census (2000). The neighborhood average income functions as a proxy 
to other characteristics, such as urban amenities. It assumes the following val- 
ues: 1,2, ... , 18; 

• FRONT IN HIGHLY VALUED NEIGHBORHOODS (FRVN): continuous quantitative 
variable that assumes strictly positive values and corresponds to the interac- 
tion between FR and VN variables. It is included in the model to capture the 
influence of land lots front dimensions in 'valuable' neighborhoods; 

• UNIT PRICE (UP): continuous quantitative variable that assumes strictly positive 
values and corresponds to the land lot price divided by its area, measured in 
R$/m 2 (reais per square meter). 

In real estate appraisals (and specifically in land lots valuations), the interest typi- 
cally lies in modeling the unit price as a function of the underlying structural, locational 
and economic characteristics of the real estate. We shall then use UP as the dependent 
variable (response). The independent variables relate to the locational (NB, VN, SZ, 
LAT, LON, ST, UC and STR), physical (AR, FR, TO, SI and FRVN) and economic (NI) land 
lot characteristics; we also account for the year of data collection. 

Figure [TJpresents box-plots of UP, AR and FR and Table[T]displays summary statistics 
on those variables. The box-plot of UP shows that its distribution is skewed and that 
there are several extreme observations. Notice from Table [T] that the sample values of 
UP range from R$ 2.36/m 2 to R$ 800.00/m 2 and that 75% of the land lots have unit 
prices smaller than R$ 82.82/m 2 . 

We note that 263 extreme observations have been identified from the box-plot of AR 
(see Figure [T]). These observations are not in error, they appear as outlying data points 
in the plot because the variable assumes a quite wide range of values: from 41 m 2 to 
91,780 m 2 , that is, the largest land lot is nearly two thousand times larger than the 
smallest one. 



Table 1. Descriptive statistics. 



Variable 


Mean 


Median 


Standard error 


Minimum 


Maximum 


Range 


UP 


72.82 


55.56 


70.28 


2.36 


800.00 


797.64 


LAT 


710100.00 


710300.00 


2722.34 


701500.00 


714600.00 


13100.00 


LON 


8787000.00 


8786000.00 


6638.77 


8769000.00 


8798000.00 


29000.00 


AR 


1355.00 


300.00 


6063.53 


48.00 


91780.00 


91732.00 


FR 


18.13 


10.00 


30.54 


2.60 


516.00 


513.40 
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UP AR FR 

Figure 1 . Box-plots of UP, AR and FR. 

In order to investigate how UP relates to some explanatory variables, we produced 
dispersion plots. Figure [2] contains the following pairwise plots: (i) UP x LAT; (ii) UP 
x LON; (iii) log(UP) x log(AR); (iv) log(UP) x log(FR); (v) UP x ST and (vi) UP x UC. 
It shows that there is a direct relationship between UP and the corresponding regressor 
in (i), (ii), (v) and (vi), whereas in (iii) and (iv) the relationship is inverse. Thus, there 
is a tendency for the land lot unit price to increase with latitude, longitude, sector 
and also with the utilization coefficient, and to decrease as the area and the front size 
increase. We note that the inverse relationship between unit price and front size was 
not expected. It motivated the inclusion of the covariate FRVN in our analysis. 

It is not clear from Figure [2] whether the usual assumptions of normality and ho- 
mo skedasticity are reasonable. As noted by Rigby & Stasinopoulos (2007), transfor- 
mations of the response variable and/or of the explanatory variables are usually made 
in order to minimize deviations from the underlying assumptions. However, this prac- 
tice may not deliver the expected results. Additionally, the resulting model parameters 
are not typically easily interpretable in terms of the untransformed variables. A more 
general modeling strategy is thus called for. 

4. Empirical modeling 

In what follows we shall estimate the hedonic price function of land lots located in 
Aracaju using the highly flexible class of GAMLSS models. At the outset, however, 
we shall estimate standard linear regression and generalized linear models. We shall 
use these fits as benchmarks for our estimated GAMLSS hedonic price function. 

4.1. Data modeling based on the CNLRM. Table [2] lists the classical normal linear 
regressions that were estimated. The transformation parameter of the Box-Cox model 
was estimated by maximizing the profile log-likelihood function: A = 0.1010. All 
four models are heteroskedastic and there is strong evidence of nonnormality for the 
first two models. The coefficients of determination range from 0.54 to 0.66. Since 
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(i) UP x LAT (ii) UP x LON (Mi) log(UP) x log(AR) 





Figure 2. Dispersion plots. 

the error variances are not constant, we present in Table [3] the estimated parameters of 
Model (1.4), which yields the best fit, along with heteroskedasticity-robust HC3 stan- 
dard errors (Davidson & MacKinnon, 1993). Notice that all covariates are statistically 
significant at the 5% nominal level, except for LAT (p-value = 0.1263), which suggests 
that pricing differentiation mostly takes place as we move in the North-South direction. 

4.2. Hedonic GLM function. Table [4] displays the maximum likelihood fit of the 
following generalized linear model: 

g(UP*) = A) + >S 2 LON + yS 3 log(AR) + /? 4 UC + j8 5 log(ST) + jS 6 STRl + ^ 7 STR2 
+ j0 8 SI + #,PA + {3 w TO + AiNIO + y8i 2 NIT + /? 13 YR06 + ^ 14 YR07 
+ j0 15 SZ + ySi 6 log(FRVN), (Model 2.1) 

where UP* = JE(UP) = /i, UP ~ gamma (/i, ex) and 77 = log(/i). We have tried a number 
of different models, and this one (gamma response and log link) yielded the best fit. 
We also note that all regressors are statistically significant at the 1% nominal level, 
except for LAT (p-value = 0.5295), which is why we dropped this covariate from the 
model. 

4.3. GAMLSS hedonic fit. 

4.3.1. Location parameter modeling (fi). Since UP (the response) only assumes posi- 
tive values, we have considered the following distributions for it: log-normal (LOGNO), 
inverse Gaussian (IG), Weibull (WEI) and gamma (GA). As noted earlier, we use 
pseudo-/? 2 given by 

pseudo-i? 2 = [correlation (observed values of UP, predicted values of UP)] 2 (4.1) 
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Model Equation 



Considerations 



1.1 UP = y6 + ySiLAT + /? 2 LON + /3 3 AR + /? 4 UC + 
& ST +/3 6 STR 1 +/? 7 STR2 +/3 8 SI +/3 9 PA +/? 10 TO + 
AiNIO+/? 1 2NIT+/3 13 YR06+/3 1 4YR07+ J 8 15 SZ + 
/3 16 FRVN + e 



The null hypotheses that the errors are ho- 
moskedastic and normal are rejected at the 1% 
nominal level by the Breusch-Pagan and Jarque- 
Bera tests, respectively. The explanatory vari- 
ables proved to be statistically significant at the 
1% nominal level (z-tests). Also, 1? = 0.539, 
AIC = 22304 and BIC = 22406. 



1.2 log(UP) =/3o+/3lLAT+/? 2 LON+ y e 3 AR+ y e 4 UC + 
y e5 ST +/3 6 STR 1 +fc STR2 +/3 8 SI +/3 9 PA +/? 10 TO + 
/?i l NIO+/?i 2 NIT+ y S 13 YR06+/3 14 YR07+j8i 5 SZ + 
/3 16 FRVN + e 



The null hypotheses that the errors are ho- 
moskedastic and normal are rejected at the 1% 
nominal level by the Breusch-Pagan and Jarque- 
Bera tests, respectively. All explanatory vari- 
ables proved to be statistically significant at 1% 

2 

the nominal level (z-tests). Also, R = 0.599, 
AIC = 2912 and BIC = 3014. 



1.3 log(UP) = /3 + jSiLAT + /3 2 LON + /? 3 log(AR) + 
/? 4 UC + /? 5 log(ST) + &STR1 + /3 7 STR2 + fcSI + 
faPA + /?i TO + /3 u NIO + /3i 2 NIT + /? 13 YR06 + 
/?i 4 YR07 + y3 15 SZ + y3 16 log(FRVN) + e 



1.4 iJ£_d = p Q + /3iLAT + /? 2 LON + /3 3 log(AR) + 
/J 4 UC + /3 5 log(ST) + /3 6 STR1 + /3 7 STR2 + yS 8 PA + 
#,TO + ySioNIO i NIT + j3 l2 YR06 + /?, 3 YR07 + 
/3 14 log(FRVN) + 6 



The Jarque-Bera test does not reject the null hy- 
pothesis of normality at the usual nominal levels, 
but the Breusch-Pagan test rejects the null hypoth- 
esis of homoskedasticity at the 1% nominal level. 
All explanatory variables are statistically signifi- 
cant at the 1% nominal level, except for the LAT 

variable (p-value = 0.0190). Also, ~R = 0.651, 
AIC = 2619 and BIC = 2721. 
Normality is not rejected by the Jarque-Bera test, 
but the Breusch-Pagan test rejects the null hypoth- 
esis of homoskedasticity at the 1% nominal level. 
All covariates proved to be statistically signifi- 
cant at the 1% nominal level, except for the LAT 
variable (p-value = 0.0881). Also, R 2 = 0.657, 
AIC = 4290 and BIC = 4392. 



to measure the overall goodness-of-fit. 

The models listed in Table [5] include smoothing cubic splines (cs) with 3 effective 
degrees of freedom for the covariates LAT, LON, log(AR), UC, ST and log(FRVN). Other 
smoothers (such as loess and penalized splines), as well as different combinations 
of D (such as BCPE, BCCG, LNO, BCT, exGAUSS, among others; see Rigby and 
Stasinopoulos, 2007) and Q (such as identity, inverse, reciprocal, among others), were 
considered. However they did not yield superior fits. We also note that Model (3.4) 
yields the smallest values of the three model selection criteria. Table [6] contains the a 
summary of the model fit. 

The use of three effective degrees of freedom in the smoothing functions delivered a 
good model fit. However, in order to determine whether a different number of effective 
degrees of freedom delivers superior fit, we used two criteria, namely: the AIC (ob- 
jective) and visual inspection of the smoothed curves (subjective); visual inspection 
aimed at avoiding overfitting. We then arrived at Model (3.5). It also uses cubic spline 
smoothing (cs), but with a different number of effective degrees of freedom (df) in the 
smoothing functions; see Table |7J Notice that there was a considerable reduction - 
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Table 3. Hedonic price function estimated via CNLRM - Model (1.4). 





Estimate 


Standard error 


z-statistic 


v- 


value 


(Intercept) 


-162.6307 


34.1920 


-4 


.756 


0. 


.0000 


LAT 


1.85e-05 


1.21e-05 


1 


.529 


0. 


.1263 


LON 


1.74e-05 


4.60e-06 


3 


.798 


0. 


.0001 


log(AR) 


-0.3507 


0.0192 


-18 


.236 


0. 


0000 


log(ST) 


0.4423 


0.0332 


13 


.297 


0. 


.0000 


UC 


0.2651 


0.0412 


O 


/ion 


0. 


.0000 


CTD 1 




U.U / 1 / 


6 


.789 


u. 


nnnn 
.uuuu 


STR2 


0.1678 


0.0675 


2 


.485 


0. 


.0130 


SI 


0.1119 


0.0405 


2 


.757 


0. 


.0058 


PA 


0.3853 


0.0302 


12 


.767 


0. 


0000 


TO 


0.4905 


0.0798 


6 


.145 


0. 


0000 


NIO 


0.5994 


0.0592 


10 


.131 


0. 


.0000 


NIT 


0.5111 


0.0131 


3 


.886 


0. 


.0000 


YR06 


0.2560 


0.0351 


7 


.289 


0. 


.0000 


YR07 


0.6450 


0.0345 


18 


.645 


0. 


.0000 


SZ 


0.7221 


0.0474 


15 


.239 


0. 


0000 


log(FRVN) 


1.2041 


0.0137 


8 


.797 


0. 


.0000 



Table 4. Hedonic price function estimated via GLM - Model (2.1). 



Estimate Standard error z-statistic p-value 



(Intercept) 


-151.8019 


15.7792 


-9.620 


0.0000 


LON 


1.77e-05 


1.80e-06 


9.851 


0.0000 


log(AR) 


-0.2276 


0.0108 


-21.120 


0.0000 


UC 


0.1272 


0.0231 


5.515 


0.0000 


log(ST) 


0.2880 


0.0193 


14.954 


0.0000 


STR1 


0.3562 


0.0395 


9.021 


0.0000 


STR2 


0.1419 


0.0408 


3.482 


0.0005 


SI 


0.0945 


0.0255 


3.707 


0.0002 


PA 


0.2324 


0.0220 


10.556 


0.0000 


TO 


0.3139 


0.0503 


6.236 


0.0000 


NIO 


0.4208 


0.0348 


12.087 


0.0000 


NIT 


0.3779 


0.0642 


5.884 


0.0000 


YR06 


0.1947 


0.0242 


8.035 


0.0000 


YR07 


0.4551 


0.0242 


18.780 


0.0000 


SZ 


0.4716 


0.0310 


15.220 


0.0000 


log(FRVN) 


0.7467 


0.0622 


11.997 


0.0000 



relative to Model (3.4) - in the AIC, BIC and GD values (18822, 19212 and 18684, 
respectively) and that there is a better agreement between observed and predicted re- 
sponse values. 

Figure [3] contains plots of the smoothed curves from Model (3.5). The dashed lines 
are confidence bands based on pointwise standard errors. Panels (I), (II), (III), (IV), 
(V) and (VI) reveal that the effects/impacts of LAT, LON, log(AR), ST, UC and log(FRVN) 
are typically increasing, increasing/decreasingQdecreasing, increasing, increasing and 
increasing, respectively, with increases in latitude, longitude, log area, socioeconomic 
indicator, utilization coefficient and log land front in highly priced neighborhoods. 
Some of these effects were also suggested by the estimated coefficients of the CNLRM 



7 Panel (II) alternately shows local increasing and decreasing trends. 



REAL ESTATE APPRAISAL OF LAND LOTS USING GAMLSS MODELS 

Table 5. Fitted models via GAMLSS. 



12 



Model D Q Equation Considerations 

3.1 LOGNO logarithmic UP = /3o + cs(LAT) + cs(LON) + All regressors are significant at the 

cs(log(AR)) + cs(UC) + cs(ST) + level 1% significance level (z-tests). 
ySiSTRl + /3 2 STR2 + /3 3 SI + /J 4 PA + Also, AIC = 19155, BIC = 19359 
/3 5 TO+/36NIO+ J 07NIT+y3 8 YRO6+ and GD = 19083. Pseudo-fl 2 = 
/3 9 YR07 + /?iqSZ + cs(log(FRVN)) 0.739. 

3.2 IG logarithmic UP = /3 + cs(LAT) + cs(LON) + All regressors are significant at the 

cs(log(AR)) + cs(UC) + cs(ST) + 1% significance level (z-test). Also, 
ySiSTRl +/J2STR2+/33SI+/34PA+ AIC = 19845, BIC = 20048 and 
/3 5 TO + ySsNIO + /? 7 NIT + y3 8 YR06 + GD = 19773. Pseudo-R 2 = 0.678. 
/3 9 YR07 + /?iqSZ + cs(log(FRVN)) 

3.3 WEI logarithmic UP = /3o + cs(LAT) + cs(LON) + All regressors proved to be signifi- 

cs(log(AR)) + cs(UC) + cs(ST) + cant at the 1% significance level (z- 
ySiSTRl + /? 2 STR2 + /? 3 SI + /J 4 PA + tests). Also, AIC = 19260, BIC = 
/3 5 TO+/36NIO+/?7NIT+y3 8 YR06+ 19463 and GD = 19188. Pseudo- 
/3 9 YR07 + gipSZ + cs(log(FRVN)) R 2 = 0.748. 

3.4 GA logarithmic UP = f! + cs(LAT) + cs(LON) + All regressors are significant at the 

cs(log(AR)) + cs(UC) + cs(ST) + 1% significance level (z-tests). Also, 

ASTRl +yS 2 STR2+/3 3 SI+/3 4 PA+ AIC = 19062, BIC = 19337 and 

/3 5 TO + /3 6 NIO + /3 7 NIT + y3 8 YR06 + GD = 19134. Pseudo-K 2 = 0.746 
/? 9 YR07 + j8 10 SZ + cs(log(FRVN)) 



Table 6. Hedonic price function estimated via GAMLSS - Model (3.4). 





Estimative 


Standard error 


z-statistic 


p- value 


(Intercept) 


-165.4000 


16.1300 


-10.251 


0.0000 


cs(LAT) 


5.17e-05 


6.22e-06 


8.307 


0.0000 


cs(LON) 


1.51e-05 


2.13e-06 


7.071 


0.0000 


cs(log(AR)) 


-0.2317 


0.0096 


-24.074 


0.0000 


cs(ST) 


0.0465 


0.0037 


12.416 


0.0000 


cs(UC) 


0.1223 


0.0206 


5.947 


0.0000 


STR1 


0.3133 


0.0349 


8.963 


0.0000 


STR2 


0.0926 


0.0364 


2.545 


0.0100 


SI 


0.0920 


0.0227 


4.054 


0.0000 


PA 


0.1891 


0.0195 


9.670 


0.0000 


TO 


0.2662 


0.0474 


5.951 


0.0000 


NIO 


0.4135 


0.0395 


13.362 


0.0000 


NIT 


0.3485 


0.0571 


6.102 


0.0000 


YR06 


0.1645 


0.0215 


7.632 


0.0000 


YR07 


0.4358 


0.0215 


20.235 


0.0000 


cs(log(FRVN)) 


0.6513 


0.0569 


11.443 


0.0000 


SZ 


0.3875 


0.0299 


12.935 


0.0000 



and GLM models. Here, however, one obtains a somewhat more flexible global pic- 
ture, as we shall see. 

In panel (I), one notices that as the latitude increases the 'contribution' of the LAT 
covariate between the 702000 and 709000 latitudes (approximately) - neighborhoods 
that belong to the expansion zone of the city - is negative, whereas starting from po- 
sition 709000 (approximately) - South Zone and downtown area - the price effect is 
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Table 7. Hedonic price function estimated via GAMLSS - Model (3.5). 



Estimative Standard error z-statistic p-value 



(Interceptt) 
cs(LAT,df=10) 
cs(LON, df=10) 
cs(log(AR), df=10) 
cs(ST, df=8) 
cs(UC, df=3) 
STR1 
STR2 



130.1000 14.8100 -8.787 0.0000 

5.92e-05 5.71e-06 10.354 0.0000 

1.05e-05 1.96e-06 5.352 0.0000 

-0.2559 8.83e-03 -28.963 0.0000 

0.0373 3.44e-03 10.831 0.0000 

0.1769 0.0188 9.370 0.0000 

0.2571 0.0320 8.012 0.0000 

0.0728 0.0334 2.180 0.0293 

0.1029 0.0208 4.940 0.0000 

0.1436 0.0179 7.999 0.0000 

0.1822 0.0410 4.436 0.0000 

0.4173 0.0284 14.690 0.0000 

0.3388 0.0524 6.462 0.0000 

0.1373 0.0198 6.941 0.0000 

0.4190 0.0197 21.190 0.0000 

0.6599 0.0522 12.630 0.0000 

0.5119 0.0275 18.613 0.0000 



SI 
PA 
TO 
NIO 
NIT 



YR06 
YR07 

cs(log(FRVN),df=10) 
SZ 



positive. Additionally, we note that, in certain ranges, increases in latitude lead to dras- 
tic changes in the slope of the smoothed curve, e.g., between the 708000 and 710000 
positions, whereas in other areas, for instance between the 706000 and 708000 lat- 
itudes - the Mosqueiro neighborhood -, an increase in latitude leads to an uniform 
negative effect. 

Panel (II) shows that as longitude increases to position 8780000 the 'contribution' 
of the LON covariate is positive and nearly uniform, which almost exclusively covers 
observations relative to the Mosqueiro neighborhood. Starting at the 8785000 position 
there is a remarkable change in the slope of the fitted curve, which is triggered by the 
location of the most upper class neighborhoods: from 8785000 to 8794000. After the 
8794000 position, the effect remains positive, but is decreasing; it eventually becomes 
negative. 

We see in Panel (III) that as the area (in logs) increases the 'contribution' of the 
log(AR) covariate, for land lots with log areas between 4 and 5 (respectively), is clearly 
positive. The effect is negative for land lots with log areas in excess of 5. 

In Panel (IV), it is possible to notice that as we move up in the socioeconomic 
scale the 'contribution' of the ST covariate, in the range from 1 to 4 minimum wages, 
is negative, even though the there is an increasing trend. For land lots located in 
neighborhoods that correspond to more than 4 minimum wages, the effect is always 
positive; from 10 to 15 minimum wages the effect is uniform. 

We note from Panel (V) that, contrary to what one would expect, the 'contribution' 
of the UC covariate is not positive. In the range from 3.0 to 5.0, the fitted curve displays 
small oscillations, alternating in the postive and negative regions. The positive effect 
only holds for utilization coefficients greater than 5.0. 

Notice from Panel (VI) that as the front land lot (in logs) increases in highly priced 
neighborhoods the 'contribution' of the log(FRVN) covariate is mostly increasing and 
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Figure 3. Smoothed additive terms - Model (3.5). 



positive. However, in the 1.5 to 2.0 interval the positive effect is approximately uni- 
form. 

4.4. Comparing models. In order to compare the best estimated models via CNLRM 
(Model (1.4)), GLM (Model (2.1)) and GAMLSS (Model (3.5)) we shall use the AIC 
and BICf|We shall also compare the different models using the pseudo-7? 2 given in 
d4lT ). 

We present in Table [8] a comparative summary of the three models. We note that 
Model (3.5) is superior to the two competing models. Not only it has the smallest 



8 The criteria shall only be used to compare models that use the response (UP) in the same measure- 
ment scale: Models (2.1) and (3.5). 
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AIC and BIC values (in comparison to Model (2.1)), but it also has a much larger 
pesudo-/? 2 . The GAMLSS pseudo-/? 2 exceeds 0.80, which is notable. 

Table 8. Comparative summary of the CNLRM, GLM and GAMLSS 
estimated models. 



Model 


Class 


AIC 


BIC 


Pseudo-i? 2 


(1.4) 


(CNLRM) 


4290 


4392 


0.667 


(2.1) 


(GLM) 


19486 


19581 


0.672 


(3.5) 


(GAMLSS) 


18822 


19212 


0.811 



4.5. Dispersion parameter modeling (<t). After a suitable model for the prediction 
of fi was selected, we carried out a likelihood ratio test to determine whether the 
GAMLSS scale parameter cr is constant for all observations. The null hypothesis 
that cr is constant was rejected at the usual nominal levels. We then built a regres- 
sion model for such a parameter. To that end, we used stepwise covariate selection, 
considered different link functions (such as identity, inverse, reciprocal, etc.) and in- 
cluded smoothing functions (such as cubic splines, loess and penalized splines) in the 
linear predictor, just as we had done for the location parameter. We used the AIC 
for selecting the smoothers and for choosing the number of degrees of freedom of the 
smoothing functions together with visual inspection of the smoothed curves. 

We present in Table [9] the GAMLSS hedonic price function parameter estimates 
obtained by jointly modeling the location (p.) and dispersion (cr) effects; Model (3.6). 
The model uses the gamma distribution for the response and the log link function for 
both ft and cr. We note that Model (3.6) contains parametric and nonparametric terms, 
and for that reason it is said to be a linear additive semiparametric GAMLSS. 

We note from Table|9]that the parameter estimates of the location submodel in Model 
(3.6) are similar to the corresponding estimates from Model (3.5), in which cr was 
taken to be constant; see Table[7} It is noteworthy, nonetheless, that there was a sizeable 
reduction in the AIC, BIC and GD values (18607, 19065 and 18445, respectively) and 
also an improvement in the residuals as evidenced by the worm plot; see Figures [4] and 

s 

Only two covariates were selected for the cr regression submodel in Model (3.6), 
namely: ST and log(AR). The former (ST) entered the model in the usual parametric 
fashion whereas the latter (log(AR)) entered the model nonparametrically through a 
cubic spline smoothing function with ten effective degrees of freedom. We note that 
the positive sign of the log(AR) coefficient indicates that the UP dispersion is larger for 
land lots with larger areas whereas the negative sign of the ST coefficient indicates that 
the dispersion is inversely related to the socioeconomic neighborhood indicator. 

It is noteworthy that the pseudo-/? 2 of Model (3.6) is quite high (0.817) and that all 
of explanatory variables are statistically significant at the 1% nominal level which is 
not all that common in large sample cross sectional analyses, especially in real estate 
appraisals. Overall, the variable dispersion GAMLSS model is clearly superior to the 
alternative models. The good fit of Model (3.6) can be seen in Figure [6] where we plot 
the observed response values against the predicted values from the estimated model. 
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yu Coefficients 





Estimative 


Standard error 


z-statistic 


p- value 


(Intercept) 


-95.1300 


14.2700 


-6.665 


0.0000 


cs(LAT, df=10) 


5.94e-05 


5.37e-06 


11.053 


0.0000 


cs(LON, df=10) 


6.45e-06 


1.86e-06 


3.460 


0.0000 


cs(log(AR), df=10) 


-0.2087 


0.0104 


-20.138 


0.0000 


cs(ST, df=8) 


0.0321 


0.0030 


10.666 


0.0000 


cs(UC, df=3) 


0.2095 


0.0161 


13.006 


0.0000 


STR1 


0.2039 


0.0298 


6.838 


0.0000 


STR2 


0.0729 


0.0276 


2.635 


0.0084 


SI 


0.7136 


0.0192 


3.705 


0.0000 


PA 


0.1653 


0.0157 


10.465 


0.0000 


TO 


0.1778 


0.0370 


4.799 


0.0000 


NIO 


0.3722 


0.0251 


14.799 


0.0000 


NIT 


0.2790 


0.0468 


5.957 


0.0000 


YR06 


0.1255 


0.0175 


7.144 


0.0000 


YR07 


0.4195 


0.0177 


23.622 


0.00 


cs(log(FRVN), df=10) 


0.6809 


0.0403 


16.88 


0.0000 


SZ 


0.4824 


0.0241 


20.001 


0.0000 


a" Coefficients 


(Intercept) 


-1.6838 


0.0839 


-20.072 


0.0000 


cs(log(AR), df=10) 


0.1370 


0.0143 


9.593 


0.0000 


ST 


-0.0391 


0.0040 


-9.632 


0.0000 




Unit normal quantile 



Figure 4. Worm plot - Model (3.5). 



Note that the 45° line in this plot indicates perfect agreement between predicted and 
observed values. 
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Figure 5. Worm plot - Model (3.6). 




Figure 6. Observed values x predicted values of UP - Model (3.6). 

Model (3.6) is given by 
logOu) = A, + cs(LAT, df = 10) + cs(LON, df = 10) + cs(log(AR), df = 10)+ 
cs(UC, df = 3) + cs(ST, df = 8) + ^STRl + &STR2 + /3 3 SI+ 
yS 4 PA + yS 5 TO + y8 6 NIO + faNIT + y8 8 YR06 + j8 9 YR07 + j3 K >SZ+ 
cs(log(FRVN), df = 10), 
log(cr) = y + riST + cs(log(AR), df = 10), 
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in which the response (UP) follows a gamma distribution (GA) with location and scale 
parameters fx and cr, respectively. This model proved to be the best model for hedonic 
prices equation estimation of urban land lots in Aracaju. 

5. Concluding remarks 

Real state appraisal is usually performed using the standard linear regression model 
or the class of generalized linear models. In this paper, we introduced real state ap- 
praisal based on the class of generalized additive models for location, scale and shape, 
GAMLSS. Such a class of regression models provides a flexible framework for the 
estimation of hedonic price functions. It even allows for some conditioning variables 
to enter the model in a nonparametric fashion. The model also accomodates variable 
dispersion and can be based on a wide range of response distributions. Our empirical 
analysis was carried out using a large sample of land lots located in the city of Aracaju 
(Brazil). The selected GAMLSS model displayed a very high pseudo-i? 2 (approxi- 
mately 0.82) and yielded an excellent fit. Moreover, the inclusion of nonparametric 
additive terms in the model allowed for the estimation of the hedonic price function 
in a very flexible way. We showed that the GAMLSS fit was clearly superior to those 
based on the standard linear regression and on a generalized linear model. We strongly 
recommend the use of GAMLSS models for real state appraisal. 
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